Python: Data Management

How to serialize and de-serialize data in Python. There are many, many options - we will show the ones that are currently popular and important in data science.

Pickle

This is probably the default serialization method used by most Python developers. Its main disadvantage is that it is Python-specific, and cannot be easily loaded in other languages. However, it is convenient if your project is Python only.

In [1]:
import pickle
In [2]:
bulbasaur = {
    "id": 1,
    "name": "bulbasaur",
    "base_experience": 64,
    "height": 7,
    "is_default": True,
    "order": 1,
    "weight": 69,
    "abilities": [
        {
            "is_hidden": True,
            "slot": 3,
            "ability": {
                "name": "chlorophyll",
                "url": "http://pokeapi.co/api/v2/ability/34/"
            }
        }
    ]
}

Pickle protocols

Objects can be pickled using 5 protocols. In general, use protocol 4 (HIGHEST_PROTOCOL) as it is the most flexible and supports very large objects, unless you need to share with Python 2, in which case use protocol 2.

Serialize

In [3]:
with open('data/bulbasaur.pickle', 'wb') as f:
    pickle.dump(bulbasaur, f, pickle.HIGHEST_PROTOCOL)

De-serialize

In [4]:
with open('data/bulbasaur.pickle', 'rb') as f:
    pokemon = pickle.load(f)
In [5]:
pokemon
Out[5]:
{'abilities': [{'ability': {'name': 'chlorophyll',
    'url': 'http://pokeapi.co/api/v2/ability/34/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 64,
 'height': 7,
 'id': 1,
 'is_default': True,
 'name': 'bulbasaur',
 'order': 1,
 'weight': 69}

Serialize to byte string

In [6]:
s = pickle.dumps(bulbasaur, pickle.HIGHEST_PROTOCOL)
In [7]:
s
Out[7]:
b'\x80\x04\x95\xd5\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x02id\x94K\x01\x8c\x04name\x94\x8c\tbulbasaur\x94\x8c\x0fbase_experience\x94K@\x8c\x06height\x94K\x07\x8c\nis_default\x94\x88\x8c\x05order\x94K\x01\x8c\x06weight\x94KE\x8c\tabilities\x94]\x94}\x94(\x8c\tis_hidden\x94\x88\x8c\x04slot\x94K\x03\x8c\x07ability\x94}\x94(h\x02\x8c\x0bchlorophyll\x94\x8c\x03url\x94\x8c$http://pokeapi.co/api/v2/ability/34/\x94uuau.'

De-serialize from byte string

In [8]:
pokemon2 = pickle.loads(s)
In [9]:
pokemon2
Out[9]:
{'abilities': [{'ability': {'name': 'chlorophyll',
    'url': 'http://pokeapi.co/api/v2/ability/34/'},
   'is_hidden': True,
   'slot': 3}],
 'base_experience': 64,
 'height': 7,
 'id': 1,
 'is_default': True,
 'name': 'bulbasaur',
 'order': 1,
 'weight': 69}

Feather

Feather is a new and highly optimized binary serialization format for columnar tabular data that is useful for loading and saving large data frames. It can also be used to share large data frames between Python and R and Julia./

Installation in Python

pip3 install  feather-format

Installation in R

install.packages("feather")
In [10]:
from pandas_datareader import data
import arrow
import feather

Download data from Google Finance

In [11]:
start = arrow.get('2010-01-01')
end = arrow.get('2016-12-31')
tickers = ['AAPL', 'MSFT', 'SPY']
data_source = 'google'
panel = data.DataReader(tickers, data_source, start.datetime, end.datetime)
In [12]:
panel.keys()
Out[12]:
Index(['Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')

Format closing prices

In [13]:
close = panel.loc['Close']
close = close.reset_index()
close.head()
Out[13]:
Date AAPL MSFT SPY
0 2010-01-04 30.57 30.95 113.33
1 2010-01-05 30.63 30.96 113.63
2 2010-01-06 30.14 30.77 113.71
3 2010-01-07 30.08 30.45 114.19
4 2010-01-08 30.28 30.66 114.57

Serialize

In [14]:
feather.write_dataframe(close, 'data/close.feather')

De-serialize

In [15]:
close2 = feather.read_dataframe('data/close.feather')
close2.head()
Out[15]:
Date AAPL MSFT SPY
0 2010-01-04 30.57 30.95 113.33
1 2010-01-05 30.63 30.96 113.63
2 2010-01-06 30.14 30.77 113.71
3 2010-01-07 30.08 30.45 114.19
4 2010-01-08 30.28 30.66 114.57

Sharing data frames between R and Python

In [16]:
%load_ext rpy2.ipython
In [17]:
%%R

library(feather)
close <- read_feather('data/close.feather')
head(close)
# A tibble: 6 x 4
                 Date  AAPL  MSFT    SPY
               <dttm> <dbl> <dbl>  <dbl>
1 2010-01-03 19:00:00 30.57 30.95 113.33
2 2010-01-04 19:00:00 30.63 30.96 113.63
3 2010-01-05 19:00:00 30.14 30.77 113.71
4 2010-01-06 19:00:00 30.08 30.45 114.19
5 2010-01-07 19:00:00 30.28 30.66 114.57
6 2010-01-10 19:00:00 30.02 30.27 114.73

In [18]:
%%R

write_feather(close, 'data/closeR.feather')
In [19]:
close3 = feather.read_dataframe('data/closeR.feather')
close3.head()
Out[19]:
Date AAPL MSFT SPY
0 2010-01-04 30.57 30.95 113.33
1 2010-01-05 30.63 30.96 113.63
2 2010-01-06 30.14 30.77 113.71
3 2010-01-07 30.08 30.45 114.19
4 2010-01-08 30.28 30.66 114.57

HDF5

HDF5 is used to store and retrieve large quantities of hierarchically organized numeric data. It can be annotated to be “self-documenting” and makes efficient use of memory. It is a standard format, so can be transferred across language platforms.

We show a simple example of saving simulation data.

Saving to disk

In [20]:
import h5py
import arrow
import numpy as np
In [21]:
with h5py.File('data/sim.hdf5', 'w') as root:
    root.attrs['author'] = 'Cliburn'
    root.attrs['description'] = 'Simulated data sets for BIOS-821 demo'

    for i in range(4):
        g = root.create_group(f'expt{i:03}')
        g.attrs['created'] = arrow.utcnow().format()
        for j in range(5):
            d = g.create_dataset(f'sim{j:03}', data=np.random.random((1000,1000)))
            d.attrs['modified'] = arrow.utcnow().format()
In [22]:
root = h5py.File('data/sim.hdf5')
In [23]:
list(root.attrs.keys())
Out[23]:
['author', 'description']
In [24]:
for k, v in root.attrs.items():
    print(f'{k:12}: {v}')
author      : Cliburn
description : Simulated data sets for BIOS-821 demo
In [25]:
def show(name):
    print(root[name])
In [26]:
root.visit(show)
<HDF5 group "/expt000" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">
<HDF5 group "/expt001" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">
<HDF5 group "/expt002" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">
<HDF5 group "/expt003" (5 members)>
<HDF5 dataset "sim000": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim001": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim002": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim003": shape (1000, 1000), type "<f8">
<HDF5 dataset "sim004": shape (1000, 1000), type "<f8">

Load a specific data set into memory

In [27]:
data = root['expt002/sim003']
In [28]:
data.shape
Out[28]:
(1000, 1000)
In [29]:
list(data.attrs.keys())
Out[29]:
['modified']
In [30]:
data.attrs['modified']
Out[30]:
'2017-07-02 18:15:24+00:00'

Close handler

In [31]:
root.close()

JSON

JSON (JavaScript Object Notation) is a very simple plain text standard that is now ubiquitously used for data transfer, especially over the web via the Representational State Transfer (REST) protocol.

JSON consists of objects (resemble Python dictionaries, except that all keys must be strings) and arrays (resemble Python lists), with values coming from

  • strings - must be double quoted
  • numbers - integers or floats
  • boolean - true or false
  • null

Objects and arrays can be nested to arbitrary levels.

The json package in the standard library will convert JSON objects to Python dictionaries and JSON arrays to Python lists, and vice-versa.

A JSON Object

Note that there must NOT be a comma after the last key: value pair. This is unlike Python dictionaries where trailing commas are OK.

{
    "class": "Cleric",
    "power source": "Divine",
    "role": "Leader"
}

A JSON Array

Here we show an array of objects, each of which has 3 key-value pairs.

[
    {
        "class": "Cleric",
        "power": "Divine",
        "role": "Leader"
    },
    {
        "class": "Fighter",
        "power": "Martial",
        "role": "Defender"
    },
    {
        "class": "Paladin",
        "power": "Divine",
        "role": "Defender"
    },
    {
        "class": "Ranger",
        "power": "Martial",
        "role": "Striker"
    },
    {
        "class": "Rogue",
        "power": "Martial",
        "role": "Striker"
    },
    {
        "class": "Warlock",
        "power": "Arcane",
        "role": "Striker"
    },
    {
        "class": "Warlord",
        "power": "Martial",
        "role": "Leader"
    },
    {
        "class": "Wizard",
        "power": "Arcane",
        "role": "Controller"
    }
]

JSON and Python

In [32]:
import requests

JSON from REST API

JSON is particularly important for transferring data over the web. More and more databases now allow you to access their data in a granular fashion using REST API calls. We illustrate with retrieval of information about berries that might be used to feed your Pokemon.

Berries are small fruits that can provide HP and status condition restoration, stat enhancement, and even damage negation when eaten by Pokémon. Check out Bulbapedia for greater detail. Description from PokeAPI

In [33]:
response = requests.get('http://pokeapi.co/api/v2/berry/1/')
In [34]:
data = response.json()
In [35]:
data
Out[35]:
{'firmness': {'name': 'soft',
  'url': 'http://pokeapi.co/api/v2/berry-firmness/2/'},
 'flavors': [{'flavor': {'name': 'spicy',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/1/'},
   'potency': 10},
  {'flavor': {'name': 'dry',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/2/'},
   'potency': 0},
  {'flavor': {'name': 'sweet',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/3/'},
   'potency': 0},
  {'flavor': {'name': 'bitter',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/4/'},
   'potency': 0},
  {'flavor': {'name': 'sour',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/5/'},
   'potency': 0}],
 'growth_time': 3,
 'id': 1,
 'item': {'name': 'cheri-berry', 'url': 'http://pokeapi.co/api/v2/item/126/'},
 'max_harvest': 5,
 'name': 'cheri',
 'natural_gift_power': 60,
 'natural_gift_type': {'name': 'fire',
  'url': 'http://pokeapi.co/api/v2/type/10/'},
 'size': 20,
 'smoothness': 25,
 'soil_dryness': 15}

Serializing to disk

In [36]:
import json
In [37]:
filename  = 'data/{}.json'.format(data['name'])
filename
Out[37]:
'data/cheri.json'
In [38]:
with open(filename, 'w') as f:
    json.dump(data, f)

De-serializing

In [39]:
with open(filename) as f:
    cheri = json.load(f)
cheri
Out[39]:
{'firmness': {'name': 'soft',
  'url': 'http://pokeapi.co/api/v2/berry-firmness/2/'},
 'flavors': [{'flavor': {'name': 'spicy',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/1/'},
   'potency': 10},
  {'flavor': {'name': 'dry',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/2/'},
   'potency': 0},
  {'flavor': {'name': 'sweet',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/3/'},
   'potency': 0},
  {'flavor': {'name': 'bitter',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/4/'},
   'potency': 0},
  {'flavor': {'name': 'sour',
    'url': 'http://pokeapi.co/api/v2/berry-flavor/5/'},
   'potency': 0}],
 'growth_time': 3,
 'id': 1,
 'item': {'name': 'cheri-berry', 'url': 'http://pokeapi.co/api/v2/item/126/'},
 'max_harvest': 5,
 'name': 'cheri',
 'natural_gift_power': 60,
 'natural_gift_type': {'name': 'fire',
  'url': 'http://pokeapi.co/api/v2/type/10/'},
 'size': 20,
 'smoothness': 25,
 'soil_dryness': 15}

Relational Database (SQLite3)

SQLite3 is part of the standard library. However, the mechanics of using essentially any relational database in Python is similar, because of the Python DB-API.

In [40]:
import sqlite3
In [41]:
c = sqlite3.connect('data/Chinook_Sqlite.sqlite')

SQLite specific commands to get metadata

Unlike SQL syntax for queries, how you get metadata from a relational database is vendor-specific. You’ll have to read the docs to find out what is needed for your SQL flavor.

What tables are there in the database?

In [42]:
list(c.execute("SELECT name FROM sqlite_master WHERE type='table';"))
Out[42]:
[('Album',),
 ('Artist',),
 ('Customer',),
 ('Employee',),
 ('Genre',),
 ('Invoice',),
 ('InvoiceLine',),
 ('MediaType',),
 ('Playlist',),
 ('PlaylistTrack',),
 ('Track',)]

What are the columns of the table “Album”?

In [43]:
list(c.execute("PRAGMA table_info(Album);"))
Out[43]:
[(0, 'AlbumId', 'INTEGER', 1, None, 1),
 (1, 'Title', 'NVARCHAR(160)', 1, None, 0),
 (2, 'ArtistId', 'INTEGER', 1, None, 0)]

Standard SQL statements with parameter substitution

Note: Using Python string substitution for Python defined parameters is dangerous because of the risk of SQL injection attacks. Use parameter substitution with ? instead.

Do this

In [44]:
t = ['%rock%']
list(c.execute("SELECT * FROM Album WHERE Title like ? LIMIT 5;", t))
Out[44]:
[(1, 'For Those About To Rock We Salute You', 1),
 (4, 'Let There Be Rock', 1),
 (59, 'Deep Purple In Rock', 58),
 (108, 'Rock In Rio [CD1]', 90),
 (109, 'Rock In Rio [CD2]', 90)]

Not this

In [45]:
t = ["'%rock%'"]
list(c.execute("SELECT * FROM Album WHERE Title like %s LIMIT 5;" % t[0]))
Out[45]:
[(1, 'For Those About To Rock We Salute You', 1),
 (4, 'Let There Be Rock', 1),
 (59, 'Deep Purple In Rock', 58),
 (108, 'Rock In Rio [CD1]', 90),
 (109, 'Rock In Rio [CD2]', 90)]

Using SQL magic functions

We will use the ipython-sql notebook extension for convenience. This will only work in notebooks and IPython scripts with the .ipy extension.

In [46]:
import warnings

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    %load_ext sql

Configuring the SqlMagic extension

In [47]:
%config SqlMagic
SqlMagic options
--------------
SqlMagic.autolimit=<Int>
    Current: 0
    Automatically limit the size of the returned result sets
SqlMagic.autopandas=<Bool>
    Current: False
    Return Pandas DataFrames instead of regular result sets
SqlMagic.column_local_vars=<Bool>
    Current: False
    Return data into local variables from column names
SqlMagic.displaylimit=<Int>
    Current: 0
    Automatically limit the number of rows displayed (full result set is still
    stored)
SqlMagic.dsn_filename=<Unicode>
    Current: 'odbc.ini'
    Path to DSN file. When the first argument is of the form [section], a
    sqlalchemy connection string is formed from the matching section in the DSN
    file.
SqlMagic.feedback=<Bool>
    Current: True
    Print number of rows affected by DML
SqlMagic.short_errors=<Bool>
    Current: True
    Don't display the full traceback on SQL Programming Error
SqlMagic.style=<Unicode>
    Current: 'DEFAULT'
    Set the table printing style to any of prettytable's defined styles
    (currently DEFAULT, MSWORD_FRIENDLY, PLAIN_COLUMNS, RANDOM)
In [48]:
%config SqlMagic.displaylimit=10

Connect to SQLite3 database

See SQLAlchemy connection strings for how to connect to other databases such as Oracle, MySQL or PostgreSQL.

In [49]:
%sql sqlite:///data/Chinook_Sqlite.sqlite
Out[49]:
'Connected: None@data/Chinook_Sqlite.sqlite'
In [50]:
%sql SELECT * from Album LIMIT 5;
Done.
Out[50]:
AlbumId Title ArtistId
1 For Those About To Rock We Salute You 1
2 Balls to the Wall 2
3 Restless and Wild 2
4 Let There Be Rock 1
5 Big Ones 3
In [51]:
%sql SELECT * from Artist LIMIT 5;
Done.
Out[51]:
ArtistId Name
1 AC/DC
2 Accept
3 Aerosmith
4 Alanis Morissette
5 Alice In Chains
In [52]:
%%sql
SELECT Artist.Name, Album.Title
FROM Album
INNER JOIN Artist on Album.ArtistId = Artist.ArtistId
ORDER BY Artist.Name ASC
LIMIT 5;
Done.
Out[52]:
Name Title
AC/DC For Those About To Rock We Salute You
AC/DC Let There Be Rock
Aaron Copland & London Symphony Orchestra A Copland Celebration, Vol. I
Aaron Goldberg Worlds
Academy of St. Martin in the Fields & Sir Neville Marriner The World of Classical Favourites

You can assign results of queries to Python names

In [53]:
result = %sql SELECT * from Album;
Done.
In [54]:
type(result)
Out[54]:
sql.run.ResultSet
In [55]:
result[2:4]
Out[55]:
[(3, 'Restless and Wild', 2), (4, 'Let There Be Rock', 1)]

You can use Python variables in your queires.

Use :varname where you want to use a Python variable in your query.

In [56]:
artist_id = 10

%sql select * from Artist where ArtistId < :artist_id;
Done.
Out[56]:
ArtistId Name
1 AC/DC
2 Accept
3 Aerosmith
4 Alanis Morissette
5 Alice In Chains
6 Antônio Carlos Jobim
7 Apocalyptica
8 Audioslave
9 BackBeat
In [57]:
word = '%rock%'

%sql select * from Album WHERE Title LIKE :word;
Done.
Out[57]:
AlbumId Title ArtistId
1 For Those About To Rock We Salute You 1
4 Let There Be Rock 1
59 Deep Purple In Rock 58
108 Rock In Rio [CD1] 90
109 Rock In Rio [CD2] 90
213 Pure Cult: The Best Of The Cult (For Rockers, Ravers, Lovers & Sinners) [UK] 139
216 Hot Rocks, 1964-1971 (Disc 1) 142

Convert to pandas dataframe

In [58]:
df = result.DataFrame()
df.head(5)
Out[58]:
AlbumId Title ArtistId
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2
3 4 Let There Be Rock 1
4 5 Big Ones 3